Tandem Anchoring: a Multiword Anchor Approach for Interactive Topic Modeling
نویسندگان
چکیده
Interactive topic models are powerful tools for understanding large collections of text. However, existing sampling-based interactive topic modeling approaches scale poorly to large data sets. Anchor methods, which use a single word to uniquely identify a topic, offer the speed needed for interactive work but lack both a mechanism to inject prior knowledge and lack the intuitive semantics needed for userfacing applications. We propose combinations of words as anchors, going beyond existing single word anchor algorithms— an approach we call “Tandem Anchors”. We begin with a synthetic investigation of this approach then apply the approach to interactive topic modeling in a user study and compare it to interactive and noninteractive approaches. Tandem anchors are faster and more intuitive than existing interactive approaches. Topic models distill large collections of text into topics, giving a high-level summary of the thematic structure of the data without manual annotation. In addition to facilitating discovery of topical trends (Gardner et al., 2010), topic modeling is used for a wide variety of problems including document classification (Rubin et al., 2012), information retrieval (Wei and Croft, 2006), author identification (Rosen-Zvi et al., 2004), and sentiment analysis (Titov and McDonald, 2008). However, the most compelling use of topic models is to help users understand large datasets (Chuang et al., 2012). Interactive topic modeling (Hu et al., 2014) allows non-experts to refine automatically generated topics, making topic models less of a “take it or leave it” proposition. Including humans input during training improves the quality of the model and allows users to guide topics in a specific way, custom tailoring the model for a specific downstream task or analysis. The downside is that interactive topic modeling is slow—algorithms typically scale with the size of the corpus—and requires non-intuitive information from the user in the form of must-link and cannot-link constraints (Andrzejewski et al., 2009). We address these shortcomings of interactive topic modeling by using an interactive version of the anchor words algorithm for topic models. The anchor algorithm (Arora et al., 2013) is an alternative topic modeling algorithm which scales with the number of unique word types in the data rather than the number of documents or tokens (Section 1). This makes the anchor algorithm fast enough for interactive use, even in web-scale document collections. A drawback of the anchor method is that anchor words—words that have high probability of being in a single topic—are not intuitive. We extend the anchor algorithm to use multiple anchor words in tandem (Section 2). Tandem anchors not only improve interactive refinement, but also make the underlying anchor-based method more intuitive. For interactive topic modeling, tandem anchors produce higher quality topics than single word anchors (Section 3). Tandem anchors provide a framework for fast interactive topic modeling: users improve and refine an existing model through multiword anchors (Section 4). Compared to existing methods such as Interactive Topic Models (Hu et al., 2014), our method is much faster.
منابع مشابه
Anchor-Free Correlated Topic Modeling: Identifiability and Algorithm
In topic modeling, many algorithms that guarantee identifiability of the topics have been developed under the premise that there exist anchor words – i.e., words that only appear (with positive probability) in one topic. Follow-up work has resorted to three or higher-order statistics of the data corpus to relax the anchor word assumption. Reliable estimates of higher-order statistics are hard t...
متن کاملAnchoring Revisited: The Role of the Comparative Question
When people estimate a numeric value after judging whether it is larger or smaller than a high or low anchor value (comparative question), estimates are biased in the direction of the anchor. One explanation for this anchoring effect is that people selectively access knowledge consistent with the anchor value as part of a positive test strategy. Two studies (total N = 184) supported the alterna...
متن کاملMicrosoft Word - Finding More Non-supersingular Elliptic Curves for Pairing..
Machine Translation, (hereafter in this document referred to as the "MT") faces a lot of complex problems from its origination. Extracting multiword expressions is also one of the complex problems in MT. Finding multiword expressions during translating a sentence from English into Urdu, through existing solutions, takes a lot of time and occupies system resources. We have designed a simple rela...
متن کاملNumerical Optimization Design of Anchoring End of Prestressed Cable at Some Hydropower Station
The paper discusses the numerical optimization design of anchoring end of prestressed cable at some hydropower station. There are two types of designing for the anchoring end, namely using square thick steel anchor bearing plate or using the roundtower anchor bearing plate. By the numerical calculation, the anchoring end using the round-tower anchor bearing plate can match the space and mechani...
متن کاملInteractive Analysis of Space Frame Raft Soil System
This study presents a new approach for physical and material modeling of space frame-raft-soil system. The physical modeling consists of a modified Thimoshenko beam bending element with six degrees of freedom per node to model the beams and columns of the superstructure, a modified Mindlin's plate bending element with five degrees of freedom per node to represent the structural slabs and raft, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017